HDI Aspects:
HDI consequences and implications:
Course of action/recommendation - More educators, Jobs and medical specialists to improve HDI
Human Development Policy
Extra Sources on HDI
Importing libraries
library(dplyr)
library(readxl)
library(tidygeocoder)
library(sf)
library(mapview)
library(RColorBrewer)
library(plotly)
Importing data
data <- read_excel("geo_NCdata.xlsx")
hdi_data <- select(data, c("City", "Education", "Income", "Occupation",
"Health Status", "Housing",
"latitude", "longitude"))
head(hdi_data)
## # A tibble: 6 x 8
## City Education Income Occupation `Health Status` Housing latitude longitude
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aggene~ 33.1 3.30 60.1 94.0 99.9 -29.2 18.8
## 2 Alexan~ 30.4 8.91 44.4 93.9 100. -28.6 16.5
## 3 Askham~ 13.2 23.3 27.0 93.8 100. -27.0 20.8
## 4 Augrab~ 16.5 25.5 29.3 93.1 97.8 -28.5 20.1
## 5 Barkly~ 28.4 27.1 38.6 91.0 84.2 -28.5 24.5
## 6 Brandv~ 13.4 14.9 43.4 95.4 99.9 -30.5 20.5
Education: Primary education ( % > 20 years old with primary education only), Matric pass rate (% Matric pass rate 2017)Income: Average per capita income (personal income), Population living below breadline (% population living below national mean level of living in 2011) , Social grant dependency (% population receiving social grants)Occupation: Unskilled workers (% of unskilled workers)Health Status: HIV/AIDS status (% Population with HIV/AIDS)Housing: Informal housing (% population living in informal housing units)fig <- hdi_data %>%
plot_ly(
y = ~Education,
type = 'violin',
box = list(visible = T),meanline = list(visible = T), x0 = 'Education')
fig <- fig %>%
layout(
title = "Distribution of Education",
yaxis = list(title = "%", zeroline = F))
fig
Cities/Towns that are not geocoded
hdi_data[rowSums(is.na(hdi_data)) > 0,]$City
## [1] "Delpoortshoop, Northern Cape" "Olynvenhoutsdrif, Northern Cape"
## [3] "Phillipstown, Northern Cape" "Soverby, Northern Cape"
Removing Cities that are not geocoded
locations_hdi <- subset(hdi_data, !is.na(hdi_data$longitude) & !is.na(hdi_data$latitude))
Clustering is a broad set of techniques for finding subgroups of observations within a data set. When we cluster observations, we want observations in the same group to be similar and observations in different groups to be dissimilar. Because there isn’t a response variable, this is an unsupervised method, which implies that it seeks to find relationships between the n observations without being trained by a response variable. Clustering allows us to identify which observations are alike, and potentially categorize them therein. K-means clustering is the simplest and the most commonly used clustering method for splitting a dataset into a set of k groups. In this case, clustering will aid in finding Cities/Towns with similar Human Development Index.
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype (centroid) of the cluster.
Clustering is the process of grouping data objects using a similarity measure.
Clustering can be hierarchical or partitional, exclusive, overlapping or fuzzy, and complete or partial.
K-Means is a partitional clustering technique; data objects are divided into non-overlapping groups.
K-Means is a prototype-based clustering
A prototype-based cluster is represented by a prototype such that all members within a cluster are close to the corresponding prototype.
Centroid and medoid are two commonly used prototypes.
K-Means clustering learns properties of a set of data points and forms partitions called clusters, that represent data with similar properties. For continuous data, each cluster is represented by the centroid which is the mean of cluster members.
locations_hdi_scale <- scale(select(locations_hdi,
c("Education", "Income", "Occupation",
"Health Status", "Housing")))
#hopkins(locations_hdi_scale, n = nrow(locations_hdi_scale)-1)
library(factoextra)
fviz_nbclust(locations_hdi_scale, kmeans, method = "wss")
fviz_nbclust(locations_hdi_scale, kmeans, method = "silhouette")
set.seed(123)
locations_hdi_cluster <- kmeans(locations_hdi_scale,
centers = 2, nstart = 25)
library(ggplot2)
library(plotly)
ggplotly(fviz_cluster(locations_hdi_cluster, data = locations_hdi_scale) +
theme_minimal() +
theme(legend.position = "none") +
ggtitle("Human Development Index Clusters (Groups)"))
Adding the clusters to the Human Development Data Frame
locations_hdi$Cluster <- as.factor(locations_hdi_cluster$cluster)
head(locations_hdi)
## # A tibble: 6 x 9
## City Education Income Occupation `Health Status` Housing latitude longitude
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aggene~ 33.1 3.30 60.1 94.0 99.9 -29.2 18.8
## 2 Alexan~ 30.4 8.91 44.4 93.9 100. -28.6 16.5
## 3 Askham~ 13.2 23.3 27.0 93.8 100. -27.0 20.8
## 4 Augrab~ 16.5 25.5 29.3 93.1 97.8 -28.5 20.1
## 5 Barkly~ 28.4 27.1 38.6 91.0 84.2 -28.5 24.5
## 6 Brandv~ 13.4 14.9 43.4 95.4 99.9 -30.5 20.5
## # ... with 1 more variable: Cluster <fct>
hdi_clust <- select(locations_hdi, c("Education", "Income", "Occupation",
"Health Status", "Housing"))
hdi_clust_table <- aggregate(hdi_clust,
by=list(cluster= locations_hdi_cluster$cluster),
mean)
hdi_clust_table
## cluster Education Income Occupation Health Status Housing
## 1 1 22.06866 20.69253 34.81301 94.06322 97.42341
## 2 2 32.32615 27.58929 42.14865 91.40413 75.68210
# locations_nr %>%
# group_by(Cluster) %>%
# summarise(n = n()) %>%
# arrange(n) %>%
# mutate(Cluster = factor(Cluster, levels = unique(Cluster))) %>%
# plot_ly(x = ~n, y = ~Cluster, type = "bar") %>%
# layout(title = "Natural Resource Grouping", yaxis = list(title = "Cluster"),
# xaxis = list(title = "Number of Cities/Towns"))
ggplotly(locations_hdi %>%
group_by(Cluster) %>%
summarise(No_of_Cities = n()) %>%
arrange(No_of_Cities) %>%
mutate(Cluster = factor(Cluster, levels = unique(Cluster))) %>%
ggplot(aes(x = Cluster, y = No_of_Cities)) +
geom_bar(stat = "identity",
fill = "#1f77b4") +
geom_text(aes(label = No_of_Cities),
vjust = -0.25) +
coord_flip() +
labs(x = "Cluster",
y = "Number of Cities/Towns",
title = "Human Development Grouping (Clusters)") +
theme_minimal())
Human_Development <- st_as_sf(locations_hdi, coords = c("longitude", "latitude"), crs = 4326)
mapview(Human_Development,
zcol = "Cluster")